T 



DOCOHEHT RESOHE 

ED 206 6B5 TH 810 583 

AOTHOP Hehrens, William A. 

TT'^LE Setting standards for ainicua Coaipetency Tests, 

POB DATE 2U Feb B1 

NOTE ' 35p,; Pevision of a speech presented at the Michigan 

School Testing Conference (Ann Arbor, ml, February 
2Ur 1981). 

EDPS PRICE nF01/PC02 Plus Postage. 

DFSCPIPTOPS Criterion Referenced Tests: ^Cutting Scores; 

Elementary Secondary Education: *Micimuffl Coapetency 
Testing; ♦Scoring Foraulr.s; ^Standards 

IDENTIPIEPS Angoff Methods; Coipromise Model (Hofstee) ; Ebel 

Method: Smpiricisn; Jaeger hethod; NedelsJcy Method 

ABSTRACT 

Some general questions about Biainua competency tests 
%re discussed, and various methods of setting I'taadards are reviewed 
with major attention devoted to those methods used for dichotomizing 
a continuum. Methods reviewed under the heading of Absolute Judgments 
of Test Content include Nedelsky^s, Angoff's^ Ebel^s, and Jaeger's. 
These methods are compare3 and a preference for Jaeger's approach is 
stated. Onder Standards Based on Judgments about Sroups, the Zieky 
and Livinaston contrasting group and borderline group methods are 
discussed. The approaches proposed by Berk and Block are briefly 
discussed as Empirical Methods for Discovering Standards. A summary 
statement lists some "DO NCT'S^* and ••DO'S*^ for setting catting 
scores. ( Author /GK) 



♦ Reproductions supplied *by BDRS are the best that can be made * 

♦ from the original document. * 

ik000 000000000 000000000^0000^0 000000000 0000000000000000000000 00000000000 



ERIC 



us OCPAHTMENT OF EDUCATIOM 

NATIONAL INSTITUTE OF EDUCATION 

EDUCATIONAL RESOURCES INFORMATION 

CENTER (ERIC) 
X IS dck:um«nt hds been repfoduced ab 
received from the pervjn o. ofqan./ation 
originating ij 

M.f.of rhanyes hdve betjn n>ddp to improve 
reprodi ^tiun quality 

• Paints of V iPw or opinions stated m this iocu 
mem do not necesianly rt'pfesem official NlE 
Position Of pjlicy 



Settinq Standards for Minimum Competency Tests* 



. M L ^ PERMISSION TO REPRODUCE THIS 

Wi 11 1 am A . Mehrens material has been grantec by 
Michigan State University ^ M<hi^^i 



TO THE EDUCATIONAL RESOURCES 
INrORMATION CENTER (ERIC) " 



2 



♦Revision of a speech given at ths Michigan School Testing Conference, Ann Arbor, 
Michigan, February 24, 1981. 



ABSTRACT 



First some general questions about minimum competency tests are 
discussed. Then various methods of setting standards are reviewed 
with major attention devoted to those methods used for dichotomizing 
a continuum. Methods reviewed under the heading of Absolute Judgments 
of Test Content include Nedel sky's, Angoff's, EbeVs, and Jaeger's. 
These methods are compared and a preference for Jaeger's approach is 
stated. Under Sta'idards based on Judgments about Groups, the Zieky 
and Livingston contrasting g»^oup and borderline group methods are 
discussed. The approaches proposed by Berk and Block are briefly 
discussed as Empirical Methods for Discovering Standards. A summary 
statement lists some "DO NOT'S" and "Dn*S" for setting cutting scores. 



ERIC 



3 



Introduction 

A. History of Minimum Competency Tests 

As many others have pointed out before me (e.g., Ebel , 1978), 
minimum competency testing has been around for a long time. A very 
early minimal competency exam was when the Gilead Guards challenged 
the fugitives from Ephriam who tried to cross the Jordon river. 

"Are you a member of the tribe of Ephriam?" they asked. If the 
man replied that he was not, then they demanded, "Say Shibboleth." 
But if he couldn't pronounce the "sh" and said Sibboleth instead of 
Shibboleth he was dragged away and killed. So forty-two thousand 
people of Ephriam died there at that time. (Judges 12:5-6, The 
Living Bible ) . 

Nothing is reported concerning the debates that may have gone on 
among the guards regarding what competencies to measure, how to 
measure them, when to measure, how to set the minimum standard, or 
Indeed what should be done with the incompitent. We do not know the 
ratio of false positives to false negatives or the relative costs of 
the two types of errors. We do know that a very minimal competency 
exam was given and that forty-two thousand people failed - with no 
chance of a retake. And some people in Michigf^n think they have it bad! 

But there have been other, less drastic competency exams - for 
example those for certifying or licensing professionals and those for 
obtaining a driver's license. 

If not a new concept, why so much fuss? Never b^afore have state and 
local agencies been so active in setting the minimum competency standards 
for elementary and secondary students. At least 35 states have t^iken 
some such type of action, and it has b^en reported (Pipho, 1978) 



-2- 



that all the remaining state: either have legislation pending or 
legislative or state board studies under way. 

B. General questions about minimum competency tests 

Over the past several years a multitude of questions have been 
raised about minimum competencv testing. For example: (1) why have 
them at all, (2) what competencies should be measured, (3) how 
should we measure them, (4) when should we measure the competencies, 
(5) who should set the minimum standard, (6) how should the mim- 
imum standard be determined, (7) should there be one minimum or 
many, and (8) what should be done with the incompetent? These 
questions are all related. The answer given for one has impli- 
ed.. ons for the answers for the others. Thus, although my charge 
today is to discuss question 6,-how should the standard be deter- 
mined? - it seems advisable to briefly mention my views on the 
answers to the other questions. Further details regarding my 
views of all these questions can be found in Mehrens (1979). 
(1) Why have standards at all? 

Why the big push for minimum competency tests v^ith specified 
standards? Many individuals believe the evidence suggests that 
the quality of our children*s education is deteriorating and 
that minimum competency testing will improve educational quality 
(or reverse any deterioration). Both points are debatable. I 
believe the first - some of you may not. The second point is 
one where I would prefer to reserve judgment but, as mentioned, 
there is some supportive evidence reported in the literature. 



ERIC 



5 



Of course there are many perceived costs as well as perceived 
benefits of minimuiTi competency testing. Perkins (in Gorth and 
Perkins, 1979), has compiled the following two lists: 

Perceived Costs of Minimum Competency Testing 

• em.)hasis on the practical will lead tc an erosion of liberal 
education 

t causes less attention to be paid to difficult-to-measure learning 
outcomes 

• promotes teaching to th3 test 

# will be the "deathknell for the inquiry approach to education" 

« oversimplifies issues of defining competencies and standards and 
of ^granting credentials to 3tJdents 

# promotes confusion as to the meaning of the high school diploma 
when competency definition is left to local districts 

4 fails to adequately consider community disagreement over the 
nature and difficulty of competencies 

# will exclude more children from schools and further stigmatize 
underachievers 

#will cause "minimums" to become "maximums," thus failing to provide 
enough instructic^^l challenge in school 

• may unfairly label students and cause more of the "less able" to 
be retained 

0 may cause an increase in dropouts, depending on the minimum 
that is set 

0 provides no recognition of the "average" student 



f fails to provide alternatives that can "inspire" average students 

to excel 1 in some areas 
f ignores the special needs of gifted students, giving them less 

opportunity to be challenged and to expand their horizons 
0 may have adverse impact on a student's future career as a result 

of a witheld diploma 
« may promote bias against racial, ethnic, ard/or special needs groups 

# places the burden of "failure" on the student 

# causes educators to be held unfairly accountable 

f intensifies the conflict for educators between humaneness and 
accountability 

# increases the record-keeping burden for administrators 

# does not assure that students will receive effective remediation 
V does not assure that all of the perceived needs and benefits will 

be met and realized 
0 promotes the power of the state at the expense of local district 
autonomy 

0 can be costly, especially where implementation and remediation are 
concerned 

Perceived Benefits of Minimum Competency Testing 

# restores meaning to a high school diploma 

# reestablishes public confidence in the schools 

t impels us to face squarely the question of "what is a high school 
education?" 

# sets meaningful standard:> for diploma award and grade promotion 



-5- 



t challenges the validity of using seat time and course credits as 

basis for certifying student accomplishments 
• certifies that students have specific minimum competencies 
n involves the public and local educators in defining educational 

standards and goals 

# focuses the resources of a school district on a clear set of goals 

# defines more precisely what skills must be taught and learned for 
students, parents, and teachers 

# promotes carefully organized teaching and carefully designed 
sequential learning 

# reemphasizes basic skills instruction 

# helps promote competencies of life after school 

# broadens educational alternatives and options 

# motivates students to master basic reading, mathematics, and 
writing skills 

# stimulates tedchers and students to put forth their best efforts 

# identifies students lacking basic skills at an early stage 

# encourages that schools help those students who have the greatest 
fducational need 

# can bring about cohesiveness in teacher training 

f encourages revision of courses to correct identified skill deficiencies 
0 can truly individualize instruction 

# shifts priorities from process to product 

# holds schools accountable for educational products 

0 furnishes information to the public about performance of 
educational institutions 

/ 

# provides an opportunity to remedy the effects of discrimination by 
gj^Q identifying learning problems early in the educational process 



• provides an opportunity to remedy the effects of discrimination 
by identifying learning problems early in the educational process 

• provides greater holding power for students in the senior year 

• provides for easier allocation of resources 

Shepard (1980) bypasses the cost/benefit debate tp discuss the 
three primary uses of competency test scores: pupil diagnosis, 
pupil certification, and program evaluation. The various methods 
used to set standards are differentially appropriate for the 
various intended uses. 

(2) What competencies 

The answer to the question of what competencies should be 
measured in a minimaal competency program is related directly to 
the purposes of the test, i.e., what inferences we wish to make 
about a person who "passes", and much less directly about the 
"purposes of the school." Many people apparently do not make 
enough of this distinction. 

Although there exists a reasonable consensus about desirable 
adult characteristics, there is considerable diversity of opinion 
about their relative importance and the role of the school in 
promoting those characteristics. Some people maintain that good 
citizenship or healthy self-concepts are more important in life 
than reading skills. Others assert just the opposite. And some 
who believe the former do not believe it is the primary purpose 
of the school '*'0 promote those characteristics. I suspect we will 



-7- 



ERIC 



never reach agreement on what characteristics we "need" in our society and 
on the role of the school in teaching, establishing, or nurturing those 
characteristics. That does not dismay me, nor do I believe it should deter 
us from determining general content for a minimal competency test. No test 
can be designed to assess the degree to which all the purposes of education 
have been achieved or even to assess whether students have achieved a level 
of minimal competency in all areas. 

Surely no one would infer that all purposes of education have been a- 
chieved if students pass a Txiinimum competency test. What will people infer 
and/or what do we want people to be able to reasonably infer from a passing 
score on a minimum competency test? 

Would any reasonable citizen infer - or would we want t^m to infer - 
tliat a passing score means the person has "survival skills" for life? Life 
is very varied, and so are the skills needed to survive. I cannot believe 
the populace is so unrealistic or naive as to think in such grandiose terms. 
Schools do not and canno i:each all survival skills. Such skills cannot even 
be adequately enumerated (or defined), and thus they cannot be adequately 
measured. Since we do not want any "survival skills" inference to be drawn 
from a test, we should not build a test to -measure such defined competencies. 

But if we measure only basic skills (applied to life settings), won't 
other areas of school suffer? I do not htink so. Remember, there is a dis- 
tinction between the purpose of school and the purposes of a minimal com- 
petency test. The purpose of the latter can never be to assess all the 
objectives of school. We all know that. Of course not all skills are basic 
and we do not want ml imums to become maximums. Few would be happy to see 
high school graduates who lacked maturity, self-discipline, and some under- 
standing of their own value systems. But if we keep in mind the limitations 
of the inferences to be drawn from passing (or failing) a minimum competency 



'-8- 



test, such limited testing should not have deleterious effects. 

We should not assume that minimal competency standards can do 
much at all to dafine the goals and objectives of education. They 
only set a lower limit of acceptable standards in certain basic skill 
areas. This certainly suggests that passing the mininial competency 
test should not be the only requirement for high school graduation. 
Other graduation requirements cOuld assure breadth in other ciireas. 
In specifying the domain of basic skills, we need to keep in mind 
the relationship between the tested domain and what is taught in 
school. We should not^ be testing content that is not taught. On the 
other hand, we should not attempt to randomly sample all that is 
taught. The tested domain must be a subset of materials taught in 
the curriculum. The domain must be defined precisely enough to rule 
out relatively unimportant specific bits of factual knowledge as well 
as processes so abstract they appear to measure general intelligence. 
There? should be evidence not only that the material tested is actually 
taught (i.e., presented) but that almost all students are capable of 
mastering the materials. 
3. How to Measure 

There are a variety of posrible meanings to the question "How 
to Measure." Hot* to sample, how to administer the measures, or how 
to build the measuring instruments are all possible meanings. Most 
people who have spoken to this question have addressed the latter 
point - usually with respect to the type of measuring instrument. 
Obviously, the choice depends somewhat on what competencies one wishes 
to measure. Remember, I have voted for baste skills. I be] ievp for 



ERIC 



il 



-9- 



ERIC 



such competencies we can get along reasonably well with wnat we have 
traditionally called objective paper -and-pencil tests, but the answer 
must depend on the specific domain definitions of the competencies. 
4 . When to Measure 

The answer to the question "When to Measure" (like the answer 
to every other question) depends on the purpose or purposes of testing. 
Of course the primary reason for minimal competency testing is to iden- 
tify students who have-not achieved the minimum. But, identify for what 
purpose? To help the students identified through remediation program? To 
motivate students through "fear of failure?" To make a high school diploma 
more meaniiigful? Let me assume the answer is yes to all questions. 

First, let me suggest that there should be periodic but not every 
year testing. I believe minimum competency programs will be more cost- 
effective if tests are given approximately three times during the K-12 
^ portion of a student's schooling, for example in grades 4, 7 and 10. 

Teachers, of course, gather almost continuous data^^ They often have 
already identified those students achieving inadequately. The formal 
tests supplement the teachers* measures and confirm or disconfirm 
previous judgments. I believe that tKis formal identification is useful. 
Tests are credible instruments, help motivate students (and teachers), 
and help assign a minimal competency raeaning to a diploma or certificate. 
I would like to stress, however, that while I favor tests it is NOT 
because I believe that teachers* judgments without them would be grossly 
inaccurate. 

I am opposed to every grade testing for minimal competencies 
because it is not cost-effective. (I am not opposed to every grade 
testing with a more general achievement measure.) Only a ve^y feu 



-10- 



studentSj we hope, will be identified as not achieving at a minimal 
level, and at any rate those identified in fourth grade would very 
likely overlap considerably with those in third or fifth grade. Further, 
arnual testing may result in grade- by grade promotion visions based 
or* test results. In spite of th^' generally favorable press for this 
approach in Greensville County, VA, I remain somewhat skeptical about 
such a plan. 

I selected early grades because I believe the evideu-e shows that 
remediation is more effective when begun early. I believe we need high 
school testing for two reasons: (1) Not all students will "make it' 
b^ seventh grade and (2) those who do many need to recheck their skills. 
Forgetting does occur between grades seven and ten - especially if the 
material is not part of the curriculum in the intervening period. 

Finally, let me stress that if minimal competency tests are used 
for high school certification or graci'»at ion, there must be opportunities 
for students who have not passed to retake the exams. Further, no test 
should be used tox such a purpose the first year it is given. To be 
fair to students there ishould be a phase- in period. 
5. Who Sets the Minlmums? 

Obviously, the minimums mast be determined by those who have 
the authority to do so. This is an agency such as a state board of 
education or a local school board. It is more difficult to decide 
who should represent this agency. Of course all constitutents should 
be involved, but I firmly believe that measurement experts need to be 
involved as well. Although setting the minimum is arbitrary, measure- 
ment experts can have some useful suggestions. These should become 
obvious as we discuss the various methods of setting the standards. 



-11- 

6. One Mininum or Many? 

The answer, again, depends on purpose. For example, do we 
wish to categorize or diagnose? 

Assume for the moment that two basic skill areas have been 
identified. If we are truly concerned with minimum perform^ince in each 
area, then we can not use a single total- score or any type of compen- 
satory scaling model across areas. Multiple cut-off scores are needed - 
at least one for each basic skill area. 

But what about the subskills (objectives) within an area? 
Will a compensatory model work there, or do we need multiple cut offs? 
If the latter, would we require a "pass" on every subskill objective 
or only on a certain percentage of them? Of course the answer, and 
the importance of the answer, will depend upon the covariance structure 
of the item scores within and across objectives and objective scores 
within the total test. If the test covers heterogeneous competencies - 
each important - then a single c-jt off score would not be too meaningful. 
Setting separate standards for each objective, however, results in another 
problen. « that of the reliable the scores. The empirical evidence 

I am aware of suggests that the covariances across objectives are 
sufficiently high so that one can defend the' us!e of a total score 
within the broader basic skill. Of course if it aeemed useful, one 
could report out objective by objective and still have only one total 
cut off score for the categorization decision. The usefulness of the 
objective by objective information would be dependent upon the reli- 
abilities of each objective score as well as the reliabilities of the 
difference scores. 



ERLC 



14 



-12- 



Even if Scores are reported objective by objective, the total 
score should be based on the total number of items correct, not the 
total nuirber of objectives passed. Unless differential weights are 
used lor the objectives, the former method is the more reliable. To 
dLchotomize objective scores before combining them results in the 
loss of information. 

Later in this paper I will address in detail one of the general 
questions raised at the beginning of this section - how should the 
minimum standard be determined. First let me mention some general 
points which anyone involved in setting cutting scores should consider. 

General points to keep in mind when setting the cutting score 

Before choosing any particular method of setting cutting scores, 
the tester should consider the legal deJensibiUty of the procedure, 
the ease and cost of implementing the procedure, public acceptance 
of both the procedure and the cutting score, its psychometric charac- 
teristics, and political considerations. (Nassif, in Gorth & Perkins, 
1979) . 

Discussing the legal def ensibility of decisions made on the basis 
of minimum competency scores would require far more time than is 
available today. Certainly legal questions concerning the authority 
to make the decision and the reliability, validity, and potential 
bias of the measuring device could be raised. But today the focus is 
not on these issues but rather on whether the cutting score is set 
appropriately. ''Appropriateness'^ in licensing or certifying decisions 
in industrial psychology would probably require empirical and/or 
logiral relationships between the cutting score and the minimum ability 
required to do the job. In education "appropriately" probably means 



-13- 



following one of the established procedures. 

Obviously some of the procedures to be discussed are much more 
complicated than others. The choice of method is affected by the 
availability of morey, time, and technical expertise. 

Also, the method chosen should provide reliable, valid, --^.id un- 
biased results. One should pay attention to the false positive and 
false negative rates a.d the relative costs of those two types of 
errors. The desirability of public acceptance is Increasingly im- 
portant. Three primary factors InfJiendng public acceptance are the 
ease with which the process can be understood, the involvement of the 
public in the process used, and the proportion of the test takers who 
fail to reach the cut off score. Although political considerations 
will not be discussed here, anyone responsible for setting cutting 
scores should become aware of the political climate of the district 
or state and the Implications of this climate for setting cutting 
scores. 

III. Methods of Setting Standards 
A* Introduction 

First off I would like to admit, like others before me, that 
the actual choice of a minimum is arbitrary, Different methods of 
setting the minimum lead to different cutoff scores, and one cannot 
say in the abstract that one methoda (or one cutoff score) is superior 
to another. Gene Glass makes the point as follows: 

"I have read the writings of those who claim the ability to make 
the determination of mastery or competency in statistical or 
psychological ways. They can't. At least, they cannot determine 

"criterion levels" or standards other than arbitrarily the 

^ language of performance stand*. rds is pseudoquantlf Ication, a 

ERIC It; 



-14- 



meaningless application of numbers to a question not prepared for 
tiuantitative analysis." (Glass, 1^78a, p. 602) 
So, admittedly, setting the standard is arbitrary. Further, it 
is politically and economically influenced. If the standards are too 
high and too many students fail, then there will surely be a public 
outcry about the quality of the schools and the unreasonableness of 
the standards. Further, if one Is committed to remediation, the costs 
of remediation conld be very high. If the standards are set too low 
then the program becomes meaningless, and if people become aware of 
the ridiculously low standards, they will again present an outcry 
about the quality of the schools. The standard setters will be damned 
either way. 

Glass raises the question of whether a criter ion-ref erEnced 
testing procedure entailing mastery levels is appropriate. He answers 
in the negative stating that "nothing may be safer than an arbitrary 
something." (Glass, 1978b, p. 258) 

Now I certainly admire Gene Glass as a person and I agree with 
much of what he has said in the two articles I have referenced in 
this section. And indeed, we might be "safer" with nothing rather 
than an arbitrary something. But let me for the moment take the other 
side. 

There is no question but that we mkke categorical decisions in 
life. If some students graduate from high school and others do not, a 
categorical decision has been made whether or not one uses a minimal 
competency exam. Even if everyone graduates, it is still a categorical 
decision if the philosophical or practical possibility of failure 
exists. If one can conceptualize performance so poor the performer 
should not graduate, then theoretically a cutoff score exists. The 



-15- 



proponents of min-'jnal competency exams seem to believe, at least 
philosophically, that there is a level of incompetence too low to 
tolerate, and that they ought to define that level so it is less 
abstract, less subjective, and perhaps a little less arbitrary than 
the way decisions are currently made* 

The above is not an argument for using minimal competency test 
alone as gradaation requirements. Nor is it an argument for using a 
dichotCDOus (as opposed to continuous) test score as one of the factors 
in that decision. What I am trying to make very clear is that ultimately - 
after combining data in some fashion - a dichotomous categorization ex- 
ists: those who receive a diploma and those who do not. No matter what 
type of equation is used, linear or nonlinear, no matter what variables 
go into the equation, no matter what coefficients precede their values, 
the final decision is dichotomous and arbitrary. The argument against 
minimal competency exams can not be that they lead to an arbitrary 
decision tnless one truly believes that all individuals - no matter 
what their level of performance - belong in the same category. 

If someone has decided to set an observed minimal test score, 
how should it be done? Theoretically this is no problem. Decision theory 
spells out exactly how to proceed. First, determine the "true" mastery 
level cutting score and the cost of false positives and false negatives. 
Then some simple mathematics will show where to set the observed cutting 
score (or, more precisely, how to allocate individuals to mastery states) 
such that the total cost of errors will be minimized. Of course we do 
not know what values to give to "true" mastery level or to the cost of 
the false positives and false negatives! 

Practically, there are many different ways that have been suggested 

mc IS 



-16- 



These are thoroughly dlscusFed in readily available literature, and 
readers wishing a more thorough presentation should check Millman 
(1974), Glass (1978b), volume 15, M of Journal of Educational Mea- 
surement (Winter 1978), HambJeton (Ch. 4 in Berk, 1980), Nasslf (Ch. 
4 in Gorth & Perkins, 1979), and Shepard (1980). 



The various abhors mentioned above categorize the mechods 
differently. For example, Hambleton talks about judgmental, empirical, 
and combination methods. GlasB categorizes the methods as 1) perform- 
ance of others, 2) counting backwards from 100%, 3) boot strapping on 
other criterion scores, 4) judging minimal competence, 5) decision- 
theoretic approaches, and 6) operations research methods. Shepard 
has two major categc^ies: methods which assume mastery is an all- 
or-none state, and methods for dichotomizing a continuum. She provides 
five subcategories for the second class. In this presentation I will 
basically follow her outline but precede her categories with one 
which Nassif calls "administrative decision or consensus.** 
B. Administrative decision or consensus 



classified as using either judgmental or statistical assumptions be- 
cause they have little structure. These methods, however, are very 
coamonly employed. "Setting standards by administrative decision 
means simply that the cut off score is determined by one or more 
persons holding a position of authority: (Nassif, in Gorth & Perkins, 
1978, p. 105). This may or may not be an informed decision, it may 
or may not be based on any data. It is an easy method to use and, 
if the person making the decision actually has the authority to do so, 
it permits some legal defense; but a good prosecuting attorney v/ould 




Nassif points out that administrative methods can not be 



ERLC 



-1/- 



make this process seem pretty Inadequate. It Is not a method which will 
necessarily win public acceptance, although the person setting the score 
may be quite sensitive to public , ffinanclal , and political concerns. 
There is no reason to believe such a method will lead to a cutoff score 
with appropriate psychometric characteristics. 

The consensus method is slmlla: except that the decision is made 
by a group of people who either have or have beer, allocated the decision 
making power. If no specific method is used by the group this procedure 
has the same advantages and limitations of the administrator decision 
making approach. If some specific methodological procedure is followed, 
wc would classify the procedure other thaa simply "consensus". 

C. Methods which assume mastery is an all-or-none state 
(counting backwards from lOOZ) 

Some standard setting models (state models) assume that mastory 
is an all or none affair an examinee either has the skill or does 
not. If a person is a master he/she should be able to get all the items 
correct except for those missed due to measurement errors. Thus, the 
standard setting task involves a question of how much to adjust the 
100% standard downward. 

"Just how great a concession is to be made becomes distressingly 
arbitrary, with some allowing a 5Z shortfall and others allowing 20% 
cr more." (Glass, 1978b, p. 244) 

Advocates of such a procedure usually ignore the fact that items 
measuring a specific objective may very greatly in difficulty. Since 
I (and most others whose writings I have read) believe the all-or-none 
assumption is not very plausible, these methods will not be considered 
further. 

2il 



-18- 



D. Methods for Dichotomizing a continuum 

In continuum models the characteristic being assessed is assumed 
to be continuous. The cut off score is chosen such that it is the 
least amount a person can score and still be considered a master. 
"All of the methods proposed to formalize the selection of this cut 
off point are decision strategies to help in thinking about what amount 
of knowledge should be required" (Shepard, 1980, X..451). 
1. Absolute Judgments of Test Content 

Criterion referenced testing typically results in absolute 
rather than relative interpretations. Thus, to many people 
it seems reasonable to simply Inspect the test content and 
to decide what percentage of correct answers indicates mas- 
tery. We will briefly consider f our '^uch methods: Nedelsky, 
Ebel, Angoff, and Jaeger, 
a) Nedelsky's approach 

The Ncdelsky (1954) approach is the oldest of the proce- 
dures and has been used considerably in r.he h'^alth professions 
which is the area for which the procedure was developed. It 
can only be used for multiple-choice questions with right ans- 
wers. Basically, the Nedelsky procedure involves asking each of 
a set of judges to look at each it — and Identify the incorrect 
options that a minimally competent individual would know were 
wrcng. Then, for each judge, the probability of a minimally 
competent student getting an item correct would be the recipro- 
cal of the remaining number of responses, (e.g., if on 5 
alternative Item, a judge feels a minimally competent student 
could eliminate 2 options, than the probability of such a person 
getting the item correct is 1/3). The expected score on the 



-19- 



teat for a minimally competent student would be the sum of the 
obtained reciprocals across all Items. Of course not all judges 
will come up with the same score so the total set of minimally 
competent scores for the judges are averaged (X) . According 
to Nedelsky, the standard deviation of the judges' scores would 
be equal to the standard deviation of the scores of minimally 
competent students. Thus, this standard deviation {CI could be 
multiplied by a constant K (decided by the judges, or test users) 
to regulate the percent of minimally competent who pass of fall. 
Thus the final cut off score is: 

C.S. » X + 

Assuming an underlying normal distribution if one wishes 50% 
of borderline examinees to fail one sets K «= 0, if one wishes 
84% to fail one set K « 1, if one wishes 16% to fall one sets 
K - -1, etc. """^ 

b. Angoff and modified Angoff approaches 

The Angoff method is similar to Nedelsky's only the judges 
are not asked to delete options but just to estimate the probab- 
ility that a minimally acceptable person would get each item 
right. The sum of the probabilities becomes the cut off score. 
ETS has simplified this procedure somewhat by providing a 
seven point scale on which percentages of minimally knowledgeable 
examinees who would get the items right are fixed (5,20,40,75, 
90,95, Do Not Know) and asking judges to mark this scale. 

c. Ebel's approach ; 

In Ebel's (1972) approach the judges are asked to rate the 
itemr on the basis of relevance (4 levels) and difficulty (3 
levels). These categories form a 4 x 3 grid. Each judge is 



-20- 



asked to assign each Item to the proper cell la the grid and 
also, once that Is done to assign to the Items In each cell 
a percentage correct that the minimally qualified person 
should be able to answer. (This percentage may be agreed on 
by the judges via some process, or one could proceed with each 
judge's values and average at the final stage.) Then the num- 
ber of questions in each cell Is multiplied by the percentage 
to obtain a minimum number of questions per cell. These numbers 
are added across the 12 cells to get the total number of ques- 
tions the minimally qualified person should be able to answer. 
Example for one judge 

Difficulty Level 

, Easy Medium Hard Summed 

Relevance ^ Itens Z Correct i Items Z Correct // Items % Correct Number x % 



Essential 


AO 


lOOZ 


15 


80% 


10 


30% 


55 


Important 


5 


90% 


10 


70% 


10 


20% 


13.5 


Acceptable 


5 


90% 


5 


AOZ 


0 


10% 


6.5 


Questionable 


0 


70% 


0 


50% 


0 


0% 


0 



p 

Cutting Score ^ 75 



d. Jaeger's approach 

This approach is primarily judgmental but does use some 
normative information so others may place this process in some 
other CEi-egory. Ir one specific example (Jaeger, 1978) 700 
people were divided into lA groups of 50 each. In each group 
everyone took the test and then answered two questions on each 
item: 



ERIC 



-21 



1) 



Should every high school graduate be able to answer 



thltf item correctly? 



2) If a student does not answer this item correctly, should 
s/he be denied a high school diploma? 
After each judge finishes this they receive the overall results 
of the survey arid their test performance. Then they are asked to 
review and revise their standards. Finally they are told the 
proportion of students who would have failed based on the recom- 
mended cut off score and asked to reconsider their ratings and 
make a final judgment regarding the necessity of passing each 
item on the test. Finally a median score is caluculated for each 
group and the cutting score is set at the lowest median cutting 
score given by the groups. 
€. Comparison of above methods 

There is no question but that different methods produce dif- 
ferent earring scores (Andrew & Hecht, 1976; Brennan & Lockwood, 
1980; Kleinke, 1980;Skakum & King, 1980). For example, in the 
Skakum & King (1980) study the Nedelsky method resulted in 23% 
failure rate and the Ebel method a 46% failure rate. 

There is no compelling theoretical reason to prefer one of 
the above methods to any other. Most writers seem to prefer 
Angoff for its simplicity. I prefer Jaeger *s approach in that it 
provides normative data but, as mentioned, it therefore maybe does 
not belong in this category of methods, 
f . Problems and considerations of these methods 

1. They do not agree with each other. 

2. There is considerable disagreement among judges within 



method and the averages obscure this* 




-22- 



3. It is difficult to build a theoretical rationale for any 
of these models. 
\^ 4. Such Issues as what value to set for K in the Nedelsky 
^ approach and how many cells to use In the Ebel approach 

allow for considerable differences within variations of 
any one approach. 
5. Standards are often set quite high under these approaches 
and thus many people fall. 

2. Standards based on judgments about groups 

As Shepard points out, judgments based on test content alone can 
result in standards that are obviously wrong. Sometimes individuals 
fall such tests when other evidence of their. mastery is mure compfelling 
than the belief in the accuracy of the standar'\ In an attempt to avoid 
such situations some people advocate setting the standard by looking at the 
performance of individuals in an identified group, 
a. Zieky & Livingston: Contrasting Groups 

In this approach, judges (teachers perhaps) are asked to pick 
individuals that clearly belong to one of two groups of examinees 
(using available information other than the ^est) : one group 
composed of individuals who are clearly masters and another 
composed of individuals who are clearly nonmasters. The test is 
then given to both groups, the distributions are plotted and an 
Initial standard is set at the intersection point of the two plots. 
Then, if judgments are available about the relative costs of 
false positives and false negatives the cutting score can be 
raised or lowered to minimize the total cost of the misclassif ications . 



ERLC 



25 



-23- 



' b. Koffler 

Koffler (1980) uses a quadratic discriminant function to set the 
cutting score, otherwise the approach js the same as the contrasting 
Groups approach. 

c. Zleky & Livingston: Borderline Group method 

This method is similar to their Contrasting Groups method except 
the judges are to choose individuals who they believe are borderline 
with respect to minimal competency. This group is given the test and 
the standard is set at the median. (Of course, one could choose to 
pass some other percentage of minimally competent individuals- This 
would be analoagous to setting a K value in the Nedelsky approach) . 
This approach is generally considered to be inferior to the Contrasting 
Groups approach because it is more difficult to identify an adequate 
sample of borderline examinees. 
3. The Use of Norms 

In some of the previous methods discussed, empirical data gathering 
techniques were used to help set standards but the standard was not based 
on a conscious decision to fall any given percent of the total set of 
individuals. To choose a cutoff score by a normative approach seems, 
to seme, to be contradictory to the purpose of criterion referenced test- 
ing. But even Popham now admits we should norm our criterion referenced 
tests (Popham, 1976). As Shepard has pointed out (and others befor^^ 
her)".... it is only the first use of criterion-referenced tests, esti- 
mating domain scores, that can be accomplished without relative comparisons. 
Qualitative judgments about the excellence or adequacy of performance 
depend implicitly on how others did on the test. Expectations about 



2B 



-24- 



what a lawyer or high-school graduate should know are normative. If 
everyr:*e could Intuit the theory of relativity on their way to work, 
iiinstein would not have been consl^'ered a genius" (Shepard, 1980, p. 
456). 

Certainly cut off scores set without any normative deta can be 
very embarassing. Rentz(1980) tells how a Georgia teacher certifica- 
tion examination had a cutting score three standard errors below what 
was considered the very least one should know but when the test was 
given too few passed it, so a new cutting score was consistent 
with a desired pass rate. 

Whether or not one should use only a normative grouT> and a desired 
pass/failure ratio is of course debatable. But leading writers now seem 
to agree tljat at least normative data could well be helpful to decision 
makers when used izr* conjunct ion with some other method. 

Hofstee's Compromise Model 

Hofstee (1980) has proposed a compromise model in which judges are 
asked to specify the following valaes. 

1. The Maximum required percentage of mastery: K-max. This is 
the cut of^v score which would be satisfactorily high even if every 
student scored tlat high or higher. 

2. The minimum acceptable percentage of mastery: K-mim. This is 
t i> cut off score wliich is as low as one would go even if no stu- 
dent attained that score. 

3. The maximum acceptable percentage of failures: F*max 

4. The minimum acceptable percentage of failures; F*min. 
Hofstee then graphs the two (dimension- test score and percent passing 

and uses a formula for arriving at a midpoint between F-min, K-max and 



-25- 



F-max, K-min. 

5. Empirical methods for discovering standards 

a. Bork's instructed and uninstructed groups 
Th*-^ approach is very similar to the Contrasting Group method 

of Zieky and Livingston. The distinct on is that one does not use 
judgment to determine who goes in which grovp. Rather the two groups 
are determined as those who have been instructed and those who have 
not. As with the contrasting groups procedure one can either set the 
standard to minimize the total number of errors or one can differ- 
entially weigh the false positives and false negatives. Berk's 
procedure is most appropriate for instructional decision making. As 
Shepard (1980) has pointed out, this procedure will not work for 
high school minimum competency testing because a) one can not iden- 
' tify instructed and uninstructed groups and b) the assumption that 
the instructed group will be predominantly masters is not neces- 
sarily valid. 

b. Block*s educational consequences 
(Glass' operations reasearch method) 

In this method one attempts to set ufte cutting score to maxi- 
mize future learning or other cognitive or affective criteria. The 
question is "What passing score maximizes educational benefits?" 
This method assumes there is some "functional relationship between 
performance on the test and level of performance on the criterion 
variable'* (Shepard, 19H0, p. 5A9) . Ac/ually, I know of only 
one study that has used this approach (Block, 1972). It, like 
the Berk method is appropriate only for Instructional decisions 

ERIC 28 



% 

-26- 



ERIC 



making, not for certification decisions. There are several prob- 
lems in Block's approach (Glass, 1978b; Hambleton & Eignor, 1979) 
and it is not one I would recommend. 

6. Empirical methods for adjusting standards 
(Glass' decision-theoretic catagory) 

The methods classified under this approach use a decision theory 
and attempt to set cutting scores to ensure a minimum cost of the 
errors. These methods are different from those that determine stand- 
ards because they presume a standard already exists on an external 
criterion and the various methods translate this external standard 
into a cut off score on the test. This means that someone has al- 
ready had to make some decision with respect to a standard on the 
criterion. Obviously if an external criterion does not exist, 
these approaches cannot be used. Eor that reason they are not likely 
to be useful in minimum competency testing programs for high school 
graduation since thei^ is no standard of "adult success**. 
Since these approaches are not likely to be useful and since they 
are fairly technical we will not review them here. Those of you in- 
terested could check Huynh (1976), Livingston (1975), Novick and Lindley 
(1978) and Vander Linden and Mellenbergh (1977). 
IV. How to choose a standard- sett ing method 
A. Factors to consider 

Generally in selecting a method, someone would keep in mind the 
points discussed earlier, such as legal def ensibility, ease of 
Implementation, financial factors and public acceptance. One 
should also consider the Importance of the decision, the quali- 
fications of the judges (since some methods require more knowledge-- 
Able Judges than others) , and the appropriateness of the method 



-27- 



for the type of decision (Hambleton, in Berk 1980). 
B. Uses of Data 

1, Pupil diagnosis 

As Shepard (1980"^ pointed out, classroom passing scores are 
usually set informally because teachers do not have the knowledge 
or resources to use the more elaborate methods. Classroom errors 
in clarfslf icatlon, moreover, are not so costly. The best advice 
to give teachers Is to keep in mind the relative costs of advancing 
someone who should be retained versus retaining someone who should 
be promoted. 

2. Pupil certification 

Shepard has made such a good statement about this that I wish 
to quote her extensively: 

"At a minimum, standard-setting procedures should Include a balancing 
of aboslute Judgments and direct attention to passing rates. All of 
the embarrassments of faulty standards that have ever been cited are 
attributable to Ignoring one or the other of these two sources of 
information If absolute judgments are Ignored, imcompetent doctors 
could pass the test if they were members of a weak class. High 
school seniors are sometimes graduated without basic skills because 
this is the norm. Since criterion-referenced testing was developed to 
overcome the problems oi relative judgments, this error is not usually 
made with crlterlon-erCerenced tests. Instead, out of loyalty to ab- 
solute standards, examining boards have made the opposite error of 
setting standards wlhout norms that fall half the medical school 
class or that fill to tall any high school graduates in an entire 
state. Direct attention to passing rates will allow standard setters 
to reconcile their beliefs about the required competencies (items on 

30 



-28- 

the Lest) and their beliefs about how many individuals are qualified." 

(Shepard. 1980, i>. 463) 

The Angoff and Jaeger methods are generally considered the most 
practical approaches of judging test content. If qualified judges of 
people exist, the Zieky & Livingston contracting groups methods appears 
most useful. The empirical methods for discovering or adjusting standards 
are useful only to the extent that they call attention to the relative 
costs of the two types of errors. 

3. Program evaluation 

Standards Impose an artifical dichotomy on data, and thus much 
information is lost about performance along the continuum in question. 
Shepard (1980, p. 468) states what many of us have believed and said 
for years: "Standards should not be used to interpret test data regard- 
ing the worth of educational programs." 
Current Practices 

it 

Procedures used in Setting Standards 



Procedure State Local 

Administrative Decision 5 6 



Contrasting Groups 
Nedelsky /Angoff 



Competency Definition 



2 3 
1 2 



Field Test Results and/or 

Other Statistical Procedures 9 7 



3 2 



*From National Evaluation Systems, 1979. The reader is referred 
CO this report for additional information. 



31 



-29- 



VI. Summary 

This presentation has covered a lot of material. Father than review 
It here, I sliall simply present a list of ''DO NOTS" and "DOs" for setting 
cutting scores. 
DO NOTS 

1. Do not set cutting scores before building the Items. 

2. Do not set cutting scores before gathering some empirical evidence 
on item difficulty from an appropriate sample of students. 

3. Do not set cutting scores without some empirical evidence regard- 
ing the teachability of the material and the educational costs 
associated with the instruction. 

4. Do not set cutting scores without explicit consideration by 
representatives of parents and educators of the relative costs of 
false positives and false negatives. 

5. Do not set cutting scores which take effect the first year the 
test is administered. 

6. Do not conclude that more rememdiation is needed in one basic 
skill than another based on the different proportions of "pass" 
scores on two non-equated tests. 

7. Do not suggest to the public that evidence of minimum performance 
is sufficient (Porter (1978) published a news release regarding 
the proportion of students who receivH "acceptable" scores in 
MicMgan".) 

8. Do not assume that one can not or should not report scores in a 

more continuous fashion even if some arbitrary cut off point has been 
established. 



ERIC 



32 



-30- 



DOs 

1. Do consider using more than test information for making important 
decisions. If test scores are combined with other data (in a 
multiple-regression sense) consider using the obtained raw score 
(or continuous scaled score transformation) rather than the art- 
ifically dichotomized value. 

2. Do remember that cutting scores can and probably do change over time. 

I first presented the above list at the Twelfth National Symposium for 
Professionals in Evaluation and Research in Cincinnati on October 17, 1978. 
The fact that I still agree with it and that therefore I have made no observ- 
able growth troubles me not - we all know about the unreliability of gain 
scores! 



ERIC 



33 



-31- 
REFEPENCES 



ERIC 



Andrew. B. J.. & Hecht, J. T. A preliminary investigation of two procedures 
for setting examination standards. Educational and Psycho l ogncaL. 
Measurement . 1976, 36, 35-50. 

Berk, R.A. (ed) Criterion-Referenced Measurement: T he state of the art- 
Baltimore: John Hopkins University Press, 1980. 

Block, J.H. Student learning and the setting of mastery performance standards. 
Educational Horizons , 1972, 50, 183-190 

Brennan, R.L., & Lockwood. R.E. A comparison of the Nedelsky and Angoff 
cutting score procedures using general izabil ity theory. Applied 
Psychological Measurement , 1980, 4, 219-240. 

Ebel, Robert L. 1978. The Case for Minimum Competency Testing. Phi Delta 
Kappan , 59, 8, 546-549. 

Ebel, Robert L. Essentials fo educational measurement . Enalewood Cliff, NJ: 
Prentice-Hall, 1972, 

Glass, G.V. Minimum competence and incompetence in Florida, Phi Delt a Kappan, 

1978, 59, 602-605. (a) 
Glass. G.V. Standards and criteria. Journal of Ed ucational Measurement, 1978. 

15, 237-261. 

Gorth, W.P., and Perkins, M.R. A Study of Minimum Compet ency Testing Pr_Qgramsi 
Final Program Development Resource Document . Amherst, MA: National 
Evaluation Systems, December, 1979. 

Hambleton, R.K., & Eignor, O.R. Competency test development, validation, 

and standard setting. In R. Jaeger & C. Tittle (Eds.), Minimum competency 
testing . Berkeley, CA: McCutchan, 1979, 

Hofstee, W.K.B. Policies of educational selection and gr ading: The case for 
compromise modeTs . Paper presented at the Fourth International Symposium 
on Educational Testing, Antwerp, Belgium, June 1980. 

Huynh, H Statistical consideration of mastery scores . Psychometrika , 1976, 
41, 65-78. 

Journal of Educational Measurement , 1978, 15, 4, 237-319. 

Jaeger, R.M. A proposal for setting a sta ndard on the North Carolina H igh School 
" Competency Test . Paper presented at the spring meeting of the North 
Carolina Association for Research in Education, Chapel Hill, 1978. 

Judges 12: 5-6. The Living BibU . 

Kleinke, D.J. Applying the Anooff and Nedels ky te chni ques to the National Licensing 
Examinations Tn Landscape Architecture . Paper presented at the annual meeting 
of the National Council on Measurement in Education, Boston, April i^su. 

Koffler, S.L. A comparison of approaches for setting proficiency standards. 
O Journal of educational Measurement , 1980, 17, 167-168. 



31 



V 

V 



"32- 



Livingston, S.A. A utility-based ap p roach to the e va luation of pass/fail 
testing decision procedures TReport No. COPA-75-OlT" Prince'ton , NJ: 
Educational Testing Service, 1975. 

Mehrens, William A., "The Technology of Competency Measurement" in R.B. Lugle, 
M.R. Carroll, & W.J. Gephart (Eds.), Asse<;§ment, gf Student Competence 
Bloomington, IN: Phi Delta Kappa, 1979. 

Millman, J. Criterion-referenced measurement. In W.J. Popham (Ed.), Evaluation 
in Education: Current applications . Berkeley, CA: McCutchan, 1974. 

Nedelsky, L. Absolute grading standards for objective tests. Educational and 
Psychological Measurement , 1954, 14, 3-19. 

Novick, M.R. & Lindley, D.V. The use of more realistic utility functions in 
educational applications. Journal of Educational Measureme t , 1978, 15, 
181-191. 

Pipho, Chris. 1978. "Minimum Competency Testing in 1978: A Look at State 
Standards." Phi Delta Kappan , 59, 9, 585-597. 

Popham, W.J. Normative data for criteiron-referenced tests? Phi Delta Kappan , 
1976, 58, 593-594. 

Porter, John W. March 21, 1978. News Release. 

Rentz, R.R. Discussion, Presented at the annual meeting of the National Council 
on Measurement in Education, Boston, April 1980. 

Shepard, L. Standard setting issues and methods. Applied Psychological Measurement . 
1980, 4, 4, 447-467. 

Skakun, E.N., & Kling, S. Comparability of methods for setting standards. Journal 
of Educational Measurement , 1980, 17, 229-235. 

van der Linden, W. J. , & Mellenbergh, G.J. Optimal cutting scores usincj a linear 
loss functiOfi. Applied Psychological Measurement , 1977, 1 , 593-599. 

Zieky, M.J., & Livingston, S.A. Manual for setting standards on the Basic Skills 
Assessment Tests . Princeton, NJ : Educational Testing Service, 1977. 



ERIC 



35 



